6. Sentiment Topic Modeling: BERT (Bidirectional Encoder Representations from Transformers)¶

In [1]:
# pip install bertopic
# pip install sentence-transformers[cpu]
# pip install matplotlib plotly
In [2]:
import os
import time
import math
import re
import sys
import requests
import multiprocessing
from pandarallel import pandarallel  
from google.cloud import storage

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

from bertopic import BERTopic
from wordcloud import WordCloud
import nltk as nltk
import ast

import warnings

# Suppress warnings if necessary
warnings.simplefilter('once')
warnings.simplefilter('ignore')
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings(action='ignore', category=UserWarning, module='gensim')
2023-12-02 02:08:18.216631: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2023-12-02 02:08:18.216781: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2023-12-02 02:08:18.417880: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2023-12-02 02:08:18.815485: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
In [3]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 500)
In [4]:
num_processors = multiprocessing.cpu_count()
num_processors

workers = num_processors-1

print(f'Using {workers} workers')
Using 15 workers
In [5]:
pandarallel.initialize(nb_workers=workers, use_memory_fs=False, progress_bar=True)
INFO: Pandarallel will run on 15 workers.
INFO: Pandarallel will use standard multiprocessing data transfer (pipe) to transfer data between the main process and workers.

1. Import Data¶

In [6]:
%%time

file_path = 'news_vader_sent.parquet'
news = pd.read_parquet(file_path)
CPU times: user 19.5 s, sys: 12.8 s, total: 32.3 s
Wall time: 29 s
In [7]:
news.shape # (198064, 16)
Out[7]:
(198064, 18)
In [8]:
news.columns
Out[8]:
Index(['url', 'date', 'language', 'title', 'text', 'year', 'month', 'day',
       'text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned',
       'title_lemm', 'title_word_count', 'text_word_count', 'vader_sent',
       'vader_comp'],
      dtype='object')
In [9]:
news.sample(1, random_state = 42)[['text_ner', 'text_cleaned', 'text_lemm', 'title_ner', 'title_cleaned', 'title_lemm']]
Out[9]:
text_ner text_cleaned text_lemm title_ner title_cleaned title_lemm
196666 Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images Skip to contentCommunity Coverage TourHome ProMedically SpeakingBest of the WestChampions in AgBack to Our AppsCOVID 19Food for NewsTexasNew to a TipLatest CamsClosings and DelaysSend Us Your Weather PhotosTxDOT Highway ConditionsDownload the Weather AppWeather ResourcesKCBD InvestigatesSubmit a TipChad Read ShootingReagor Dykes CoverageSex Trafficking on the South PlainsLubbock County Medical E... prosecutors states urge congress strengthen tools fight ai child sexual abuse images skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend us weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dykes coveragesex trafficking south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell somethi... prosecutor state urge congress strengthen tool fight ai child sexual abuse image skip contentcommunity coverage tourhome promedically speakingbest westchampions agback appscovid newstexasnew tiplatest camsclosings delayssend u weather photostxdot highway conditionsdownload weather appweather resourceskcbd investigatessubmit tipchad read shootingreagor dyke coveragesex traffic south plainslubbock county medical examiner school beat petestats predictionshow watchcommunitytell something goodnot... Prosecutors in all states urge Congress to strengthen tools to fight AI child sexual abuse images prosecutors states urge congress strengthen tools fight ai child sexual abuse images prosecutor state urge congress strengthen tool fight ai child sexual abuse image

2. Sentiment Topic Modeling: BERT¶

Topic modeling (i.e. LDA using gensim or ktrain) or using BERTopic

BERTopic¶

  • Nature: BERTopic leverages transformer-based models, like BERT, for generating document embeddings, which capture the contextual relationships between words in a text.
  • Methodology: It uses dimensionality reduction (usually UMAP) and clustering algorithms (like HDBSCAN) on top of the embeddings to find topics.
  • Advantages: BERTopic excels in capturing the semantic meaning of texts, offering more nuanced and contextually relevant topics.
  • Use Cases: It is well-suited for advanced topic modeling tasks where deep contextual understanding is crucial.
  • Computational Requirements: Similar to BERT, BERTopic is computationally intensive and generally requires more resources.

LDA in Gensim¶

  • Nature: This is a traditional topic modeling approach that assumes each document is a mixture of topics and each topic is a mixture of words.
  • Methodology: It uses statistical methods to infer the latent topics in a corpus.
  • Advantages: LDA in Gensim is well-established, easy to implement, and less resource-intensive compared to neural network approaches.
  • Use Cases: Suitable for basic topic modeling needs where the primary goal is to identify broad topics within a large volume of text.
  • Computational Requirements: Can be run efficiently on standard CPU setups.

LDA in ktrain¶

  • Nature: ktrain, a wrapper for TensorFlow Keras, simplifies machine learning workflows. Its LDA implementation is similar to Gensim's but integrated within the ktrain ecosystem.
  • Methodology: Utilizes statistical methods for topic modeling, akin to Gensim's LDA.
  • Advantages: It provides a more user-friendly interface and integrates well with other ktrain functionalities for end-to-end machine learning tasks.
  • Use Cases: Ideal for users who prefer a streamlined process for topic modeling along with other machine learning tasks, especially in a Keras/TensorFlow environment.
  • Computational Requirements: Comparable to Gensim's LDA in terms of resource needs.

Summary¶

  1. BERTopic: Best for deep contextual understanding and advanced topic modeling, but resource-intensive.
  2. LDA in Gensim: A standard, widely-used method for topic modeling, balancing performance and computational efficiency.
  3. LDA in ktrain: Offers a more accessible and integrated approach within the ktrain framework, suitable for those working within a Keras/TensorFlow environment.
In [10]:
%%time
news['text_tokens'] = news['text_lemm'].parallel_apply(nltk.word_tokenize)
VBox(children=(HBox(children=(IntProgress(value=0, description='0.00%', max=13205), Label(value='0 / 13205')))…
CPU times: user 29.2 s, sys: 17.4 s, total: 46.6 s
Wall time: 2min 1s

2.2. BERTopic on Negative Topics¶

In [11]:
news_ne = news[news['vader_sent'] == 'negative']
In [12]:
news_ne.info()
<class 'pandas.core.frame.DataFrame'>
Index: 9947 entries, 40 to 198047
Data columns (total 19 columns):
 #   Column            Non-Null Count  Dtype         
---  ------            --------------  -----         
 0   url               9947 non-null   object        
 1   date              9947 non-null   datetime64[ns]
 2   language          9947 non-null   object        
 3   title             9947 non-null   object        
 4   text              9947 non-null   object        
 5   year              9947 non-null   int32         
 6   month             9947 non-null   int32         
 7   day               9947 non-null   int32         
 8   text_ner          9947 non-null   object        
 9   text_cleaned      9947 non-null   object        
 10  text_lemm         9947 non-null   object        
 11  title_ner         9947 non-null   object        
 12  title_cleaned     9947 non-null   object        
 13  title_lemm        9947 non-null   object        
 14  title_word_count  9947 non-null   int64         
 15  text_word_count   9947 non-null   int64         
 16  vader_sent        9947 non-null   object        
 17  vader_comp        9947 non-null   float64       
 18  text_tokens       9947 non-null   object        
dtypes: datetime64[ns](1), float64(1), int32(3), int64(2), object(12)
memory usage: 1.4+ MB
In [13]:
%%time
mod_BERT_neg = BERTopic(calculate_probabilities=True, verbose=True)
topics_neg, probabilities_neg = mod_BERT_neg.fit_transform(news_ne['text_lemm'].tolist())
2023-12-02 02:11:03,498 - BERTopic - Embedding - Transforming documents to embeddings.
Batches:   0%|          | 0/311 [00:00<?, ?it/s]
2023-12-02 02:15:29,466 - BERTopic - Embedding - Completed ✓
2023-12-02 02:15:29,467 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm
2023-12-02 02:16:10,373 - BERTopic - Dimensionality - Completed ✓
2023-12-02 02:16:10,374 - BERTopic - Cluster - Start clustering the reduced embeddings
2023-12-02 02:16:42,781 - BERTopic - Cluster - Completed ✓
2023-12-02 02:16:42,789 - BERTopic - Representation - Extracting topics from clusters using representation models.
2023-12-02 02:16:55,952 - BERTopic - Representation - Completed ✓
CPU times: user 33min 19s, sys: 6min 6s, total: 39min 25s
Wall time: 5min 59s
In [14]:
mod_BERT_neg.get_topic_info().head(20)
Out[14]:
Topic Count Name Representation Representative_Docs
0 -1 2341 -1_ai_news_use_say [ai, news, use, say, new, video, chatgpt, day, technology, make] [google ai system answer negative question vladimir putin ask russian make argument trump racist daily mail online home showbiz femail royal health science sport politics money video travel shop late headline nasa apple twitter game profile logout login privacy policy feedback thursday sep day forecast advertisement sophie turner sue lie ex joe jonas bid move kid back forever home uk accuse withholding passport refuse let travel bitter divorce oklahoma death row inmate execute lethal injecti...
1 0 200 0_ago_hour_newsfeed_story [ago, hour, newsfeed, story, weather, video, day, top, app, bestreviews] [bad deal good name ai rise sextortion skip content wilkes barre sign wilkes barre sponsor toggle menu open navigation close navigation search please enter search term primary menu news email newsletter signup local news crime court team submit team tip traffic information instapoll agnes fifty look back flood national news local election headquarters healthbeat veteran voice veteran view newsmakers eyewitness history week pennsylvania politics hill automotive news press release top story cr...
2 1 156 1_student_chatgpt_cheat_school [student, chatgpt, cheat, school, teacher, university, assignment, write, essay, professor] [seattle ban student use chatgpt doom come next skip main content turn refresh currently reading seattle ban student use chatgpt doom come next subscribe subscribe edition sign capital regionbest capital firsthudson valleysportshigh race empirepro sportsnew yorkstate calendarart exhibitsmovies tvmusic concertstheater dancewere see worldfood drinktable hoppinglife valuesspecial reportsreal estatefor salefor rentvirtual fairplace adclassifiedssearch classifiedsplace classify adlegal noticespla...
3 2 123 2_stereotype_porn_star_deepfake [stereotype, porn, star, deepfake, image, people, sex, less, may, robot] [deepfake porn could grow problem amid ai race fox skip content fox indianapolis live fmr pres trump speaks nra sign indianapolis live sponsor toggle menu open navigation close navigation search please enter search term primary menu news indiana news crimetracker video demand investigates crime mapping digital exclusive black history month ibj medium inside indiana business living healthy newsnation national world focus health bestreviews bestreviews daily deal local election headquarters po...
4 3 111 3_breast_cancer_mammogram_radiologist [breast, cancer, mammogram, radiologist, patient, screen, woman, doctor, detect, study] [could ai beat human spot breast tumour advertisement health breast cancer news medical lifestyle expert diet nutrition medical scheme beyond beauty home medical breast cancer news january could ai beat human spot breast tumour use ai spot breast tumour could next big thing replace human time soon machine train outperform human come catch breast tumour mammogram new study google several university work artificial intelligence ai model aim improve accuracy mammography screen january issue nat...
5 4 102 4_gebru_google_mitchell_fire [gebru, google, mitchell, fire, timnit, employee, departure, ethic, paper, company] [google ai team demand oust black researcher rehired promote kuaf skip main content site menu donate menu contact u internship newsletter staff station air web comm calendar daily weekly schedule podcasts r feed program way listen member volunteer become member issue challenge leave legacy membership faq sustain membership vehicle donation program volunteer search social medium facebook instagram linkedin twitter submit psa story underwriting sponsorship menu contact u internship newsletter ...
6 5 99 5_destefano_daughter_clone_voice [destefano, daughter, clone, voice, mom, phone, scam, mayo, kambhampati, go] [get daughter mom warns terrify ai voice clone scam fake kidnapping skip contentnewsfirst alert pickflint water vaultfirst alert day zone forecastradarday plannerscurrent conditionsweather camsclosings sportsfriday night lightssaginaw lifestyle showaging stylebest brightestbetter dealsexcellence educationjobspet vaultsubmit photo usnextgen tvmeet teamdownload news weather newscastsvuit scheduleinvestigate tvpower nationpress releasescircle country music lifestyle get daughter mom warns terri...
7 6 97 6_italy_italian_openai_chatgpt [italy, italian, openai, chatgpt, data, watchdog, protection, user, privacy, ban] [italy temporarily block chatgpt privacy concern today bc searchhome newsletter subscribe subscribe login logout support centre puzzle contest news covid politics national politics world news sport cannabis travel podcasts video opinion classified job business entertainment life weather obituary contact u contact u black press faq privacy policy term use news politics national cannabis travel obituary classified contact u subscribe login puzzle contest podcasts video life opinion trend black...
8 7 89 7_market_etf_stock_bank [market, etf, stock, bank, rate, number, canada, bloomberg, inflation, fed] [google fall behind ai arm race senior engineer warns bnn bloomberg market index currency energy metal number number data number number number number data number number market market market index currency energy metal number number data number number number number data number number currency stock formatprefix formatnetchange look stock try one result bnn look stock try one result live video show market call market invest personal finance real estate company news commodity economics politics...
9 8 82 8_cancer_tumor_patient_lung [cancer, tumor, patient, lung, cell, treatment, disease, diagnosis, study, brain] [new clue ai cancer prognosis main section home u upcoming event r donate contact u advertise category alternative therapy blood heart circulation bone muscle brain nerve cancer child health cosmetic surgery digestive system disorder condition drug approval trial ear nose throat environmental health eye vision female reproductive genetics birth defect geriatrics age health informatics hematology immune system infection kidney urinary system legal regulatory life style fitness lung breathing ...
10 9 80 9_china_chinese_beijing_taiwan [china, chinese, beijing, taiwan, military, wray, superpower, kai, fu, epub] [world artificial intelligence conference kick china shanghai news breaksearch location channel topic people insign channelsadd useprivacy policydo sell infohelp centerabout particle article shanghai ai waic aiyou may also likenews breakartificial intelligencenews useprivacy policydo sell infohelp centerabout particle intechnologyworld artificial intelligence conference kick china day agoshanghai july xinhua world artificial intelligence conference waic kick china shanghai thursday theme int...
11 10 78 10_bing_microsoft_chatbot_chat [bing, microsoft, chatbot, chat, user, engine, hitler, search, question, insult] [microsoft bing ai chatbot improve company say skip navigation share facebook share twitter share sm share email navigation news back local near nation world health life money entertainment feature late news story new legislation seek give late judge constance baker motley congressional gold medal lawyer ex memphis cop connecticut charge tyre nichols death call continued civility hearing weather back forecast radar hourly day map traffic closing delay late weather story forecast linger showe...
12 11 76 11_fire_smoke_wildfire_camera [fire, smoke, wildfire, camera, satellite, ororatech, firefighter, blaze, selegue, detect] [threat wildfire rise new artificial intelligence solution fight skip contentsend late uspick pethomewatch livenewssend turncraig listmilitary mattersblast pastnationalmeet teamweatherdelays closing classsportsathlete weekbraggin rightsstats predictionshow watchhealthyour morning checkupask country starsemily community calendarchris man tv dinnersmr foodpick petconteststv scheduleabout usmeet teamcontact usjob openingscircle country music lifestyleprevious newscastsgray dc releasesthe threat...
13 12 74 12_coronavirus_covid_virus_outbreak [coronavirus, covid, virus, outbreak, vaccine, infect, pandemic, disease, health, spread] [artificial intelligence tool accurately predicts infect covid pateints would go develop severe respiratory disease latestly live break news seven people evacuate iran test positive covid coronavirus outbreak live news update march english मर tuesday march late story minute ago greyhound peter rabbit list new tentative release date hollywood movie get postpone due covid seven people evacuate iran test positive covid coronavirus outbreak live news update march artificial intelligence tool acc...
14 13 73 13_bing_microsoft_ago_chatbot [bing, microsoft, ago, chatbot, lie, hitler, search, engine, generate, insult] [bing belligerent microsoft look tame ai chatbot kwkt fox skip content kwkt fox waco sign waco sponsor toggle menu open navigation close navigation search please enter search term primary menu stream newsnation live weather camera view skytracker news local news state news texas governor debate national world news political news politics hill washington dc business news crime press release weird news entertainment news health news coronavirus border report fort hood local election hq automot...
15 14 71 14_actor_writer_hollywood_strike [actor, writer, hollywood, strike, studio, movie, sag, netflix, aftra, film] [hollywood strike inflame claim ai could writer job add france home screen charles iii ukraine sudan tv france live see show news accessibility tv guide topic environment business tech sport culture infographics fight fake sponsor content region france africa middle east america europe asia pacific follow u copyright france right reserve france responsible content external website audience rating certify acpm ojd français english español عربي offline navigation sign newsletter manage privacy...
16 15 67 15_clearview_facial_recognition_enforcement [clearview, facial, recognition, enforcement, law, database, privacy, scrap, shelagh, caller] [face scanner clearview ai aim branch beyond police newsbreaksign arttv seriesbooks dancebehind viral videosperforming artstv musichip healthhealth servicesmental healthdiseases healthcancerfood sportspremier drinkspetsbeauty safetypublic safetyaccidentslaw enforcementtraffic advicefamily rentlabor issuestrouble scienceearth nationsmiddle location channel topic people inabc news news context analysis local statein article clearview ai police scanner ukrainian clearview ai co associate pressy...
17 16 66 16_window_click_opinion_open [window, click, opinion, open, obituary, share, subscribe, california, log, medianews] [ai advance terminator arrive press telegram skip content section subscribe sunday january edition home page close menunews news crime public safety investigative reporting politics health environment business housing job local news local news long beach los angeles los angeles county sport sport high school sport charger ram lakers clipper dodger angel college sport ucla sport usc sport long beach state sport king duck boxing mma soccer thing thing restaurant food drink movie music concert ...
18 17 65 17_bias_black_blackness_racist [bias, black, blackness, racist, system, algorithm, facial, racism, racial, label] [complex intersection ai anti blackness menu home buy black cincy rhythm river virtual career fair news schedule event prize contact u advertise u facebook instagram youtube twitter eeo listen live listen live toggle search search search national intersection ai anti blackness unravel complex problem write shannon dawson publish july share share post share link via copy link copy copy listen live like u facebook follow u twitter feature video close source fg trade getty artificial intelligen...
19 18 65 18_safetylit_aa_crash_injury [safetylit, aa, crash, injury, severity, accident, doi, pdf, citation, bulletin] [safetylit chance traffic collision predict use machine learn home search boolean search thesaurus source author weekly update update bulletin pdf update bulletin web u safetylit aa aa aa safetylit weekly update compile citation summary new article every week r feed help tutorial faq contact u contact info safetylit service search result journal article chance traffic collision predict use machine learn citation goravanakolla int re modern eng technol sci copyright copyright irjmets doi pmid...
In [16]:
negative_topic_df = pd.DataFrame(mod_BERT_neg.get_topic_info())
In [17]:
print(negative_topic_df.shape)
(248, 5)
In [18]:
negative_topic_df.head()
Out[18]:
Topic Count Name Representation Representative_Docs
0 -1 2341 -1_ai_news_use_say [ai, news, use, say, new, video, chatgpt, day, technology, make] [google ai system answer negative question vladimir putin ask russian make argument trump racist daily mail online home showbiz femail royal health science sport politics money video travel shop late headline nasa apple twitter game profile logout login privacy policy feedback thursday sep day forecast advertisement sophie turner sue lie ex joe jonas bid move kid back forever home uk accuse withholding passport refuse let travel bitter divorce oklahoma death row inmate execute lethal injecti...
1 0 200 0_ago_hour_newsfeed_story [ago, hour, newsfeed, story, weather, video, day, top, app, bestreviews] [bad deal good name ai rise sextortion skip content wilkes barre sign wilkes barre sponsor toggle menu open navigation close navigation search please enter search term primary menu news email newsletter signup local news crime court team submit team tip traffic information instapoll agnes fifty look back flood national news local election headquarters healthbeat veteran voice veteran view newsmakers eyewitness history week pennsylvania politics hill automotive news press release top story cr...
2 1 156 1_student_chatgpt_cheat_school [student, chatgpt, cheat, school, teacher, university, assignment, write, essay, professor] [seattle ban student use chatgpt doom come next skip main content turn refresh currently reading seattle ban student use chatgpt doom come next subscribe subscribe edition sign capital regionbest capital firsthudson valleysportshigh race empirepro sportsnew yorkstate calendarart exhibitsmovies tvmusic concertstheater dancewere see worldfood drinktable hoppinglife valuesspecial reportsreal estatefor salefor rentvirtual fairplace adclassifiedssearch classifiedsplace classify adlegal noticespla...
3 2 123 2_stereotype_porn_star_deepfake [stereotype, porn, star, deepfake, image, people, sex, less, may, robot] [deepfake porn could grow problem amid ai race fox skip content fox indianapolis live fmr pres trump speaks nra sign indianapolis live sponsor toggle menu open navigation close navigation search please enter search term primary menu news indiana news crimetracker video demand investigates crime mapping digital exclusive black history month ibj medium inside indiana business living healthy newsnation national world focus health bestreviews bestreviews daily deal local election headquarters po...
4 3 111 3_breast_cancer_mammogram_radiologist [breast, cancer, mammogram, radiologist, patient, screen, woman, doctor, detect, study] [could ai beat human spot breast tumour advertisement health breast cancer news medical lifestyle expert diet nutrition medical scheme beyond beauty home medical breast cancer news january could ai beat human spot breast tumour use ai spot breast tumour could next big thing replace human time soon machine train outperform human come catch breast tumour mammogram new study google several university work artificial intelligence ai model aim improve accuracy mammography screen january issue nat...
In [19]:
negative_topic_df.to_parquet('bert_ne_topic_info.parquet')
In [20]:
# Google Cloud Storage details
bucket_name = 'nlp-final'
file_path = 'bert_ne_topic_info.parquet'  # This is the name the file will have in GCS
local_file_path = 'bert_ne_topic_info.parquet'  # Path to the local file you just saved

# Create a GCS Client
storage_client = storage.Client()

# Get the bucket
bucket = storage_client.get_bucket(bucket_name)

# Create a blob object from the filepath
blob = bucket.blob(file_path)

# Upload the file
blob.upload_from_filename(local_file_path)
In [21]:
news_ne['bert_topics'] = mod_BERT_neg.topics_
news_ne['bert_topics_words'] = news_ne['bert_topics'].apply(lambda x: mod_BERT_neg.get_topic(x))
In [22]:
news_ne.sample(3, random_state = 42)
Out[22]:
url date language title text year month day text_ner text_cleaned text_lemm title_ner title_cleaned title_lemm title_word_count text_word_count vader_sent vader_comp text_tokens bert_topics bert_topics_words
72547 https://www.newschannel10.com/2021/01/18/ex-florida-data-scientist-jail-after-arrest-warrant-issued/ 2021-01-19 en Ex-Florida data scientist turns herself in after arrest warrant issued Ex-Florida data scientist turns herself in after arrest warrant issued \n\n \n\n Skip to content Go Local Grow with Us Expert Connections Health Connections Contests Moms Talk Baby Boomers Talk Panhandle Deals Viewers Choice Awards Home News WATCH LIVE Weather Closings Coronavirus Vaccine Watch Community Sports About Us Home Election Res... 2021 1 19 Ex Florida data scientist turns herself in after arrest warrant issued Skip to content Go Local Grow with Us Expert Connections Health Connections Contests Moms Talk Baby Boomers Talk Panhandle Deals Viewers Choice Awards Home News WATCH LIVE Weather Closings Coronavirus Vaccine Watch Community Sports About Us Home Election Results Download our Apps WATCH LIVE Go Local News National Crime Education Perspective with Brent McClure Good News With Doppler Dave Coronavirus Vaccine Watch Panhandle... ex florida data scientist turns arrest warrant issued skip content go local grow us expert connections health connections contests moms talk baby boomers talk panhandle deals viewers choice awards home news watch live weather closings coronavirus vaccine watch community sports us home election results download apps watch live go local news national crime education perspective brent mcclure good news doppler dave coronavirus vaccine watch panhandle magazine winter summer spring winter fall us... ex florida data scientist turn arrest warrant issue skip content go local grow u expert connection health connection contest mom talk baby boomer talk panhandle deal viewer choice award home news watch live weather closing coronavirus vaccine watch community sport u home election result download apps watch live go local news national crime education perspective brent mcclure good news doppler dave coronavirus vaccine watch panhandle magazine winter summer spring winter fall u advertise newsc... Ex Florida data scientist turns herself in after arrest warrant issued ex florida data scientist turns arrest warrant issued ex florida data scientist turn arrest warrant issue 8 613 negative -0.9840 [ex, florida, data, scientist, turn, arrest, warrant, issue, skip, content, go, local, grow, u, expert, connection, health, connection, contest, mom, talk, baby, boomer, talk, panhandle, deal, viewer, choice, award, home, news, watch, live, weather, closing, coronavirus, vaccine, watch, community, sport, u, home, election, result, download, apps, watch, live, go, local, news, national, crime, education, perspective, brent, mcclure, good, news, doppler, dave, coronavirus, vaccine, watch, panh... 75 [(jones, 0.043355561817188616), (florida, 0.04173536174273128), (warrant, 0.02963598659685979), (department, 0.025011008505508917), (tallahassee, 0.020878685891533483), (message, 0.018818827993956955), (nan, 0.01656412064049838), (computer, 0.015099882128683483), (scientist, 0.0141806098929733), (illegally, 0.013642911165518351)]
105006 https://www.mysanantonio.com/news/article/Flood-forecasts-in-real-time-with-block-by-block-17725033.php 2023-01-18 en Flood forecasts in real-time with block-by-block data could save lives – a new machine learning method makes it possible \nFlood forecasts in real-time with block-by-block data could save lives – a new machine learning method makes it possible\n\n \n\n \n \n\n \n\n \n\n \n \n\n \nSkip to main content\n\n\n MySA Homepage\n\nCurrently Reading\nFlood forecasts in real-time with block-by-block data could save lives – a new machine learning method makes it possible\n\nNewsletters\n\nSign In\n\n \nHomeSubscribeBuy E-N MerchandiseContact UsAbout UsAdvertise With UsPlace a Classified AdPrivacy NoticeNewsletters & Tex... 2023 1 18 Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Skip to main content MySA Homepage Currently Reading Flood forecasts in real time with block by block data could save lives a new machine learning method makes it possible Newsletters Sign In HomeSubscribeBuy E N MerchandiseContact UsAbout UsAdvertise With UsPlace a Classified AdPrivacy NoticeNewsletters Text AlertsFind a Business in S.A.Manage by to San AntonioClassified Ma... flood forecasts real time block block data could save lives new machine learning method makes possible skip main content mysa homepage currently reading flood forecasts real time block block data could save lives new machine learning method makes possible newsletters sign homesubscribebuy merchandisecontact usabout usadvertise usplace classified adprivacy noticenewsletters text alertsfind business san antonioclassified marketplacetop lawyersnationborder newsreal estatehome searchland salesre... flood forecast real time block block data could save life new machine learn method make possible skip main content mysa homepage currently reading flood forecast real time block block data could save life new machine learn method make possible newsletter sign homesubscribebuy merchandisecontact usabout usadvertise usplace classify adprivacy noticenewsletters text alertsfind business san antonioclassified marketplacetop lawyersnationborder newsreal estatehome searchland salesrentalshomes guid... Flood forecasts in real time with block by block data could save lives flood forecasts real time block block data could save lives flood forecast real time block block data could save life 10 964 negative -0.1725 [flood, forecast, real, time, block, block, data, could, save, life, new, machine, learn, method, make, possible, skip, main, content, mysa, homepage, currently, reading, flood, forecast, real, time, block, block, data, could, save, life, new, machine, learn, method, make, possible, newsletter, sign, homesubscribebuy, merchandisecontact, usabout, usadvertise, usplace, classify, adprivacy, noticenewsletters, text, alertsfind, business, san, antonioclassified, marketplacetop, lawyersnationbord... 213 [(tpr, 0.033550158894503856), (delmarva, 0.024728626694621437), (antonio, 0.016485931371684377), (texas, 0.012284491904531835), (flood, 0.01051443818294521), (infoaccurate, 0.010467426060000026), (fm, 0.010389263069446205), (wgmd, 0.01016508094318885), (talk, 0.0098345087259608), (accurate, 0.008830676110647888)]
107850 https://www.knoe.com/2023/03/29/musk-scientists-call-halt-ai-race-sparked-by-chatgpt/ 2023-03-29 en Musk, scientists call for halt to AI race sparked by ChatGPT Musk, scientists call for halt to AI race sparked by ChatGPT\n\nSkip to contentTornado Disaster ReliefNewsWeatherSportsOur TownLivestreamContestsNewsArkansasCOVID-19 InfoWhat's Your StoryNationalRegionalStateNELA Home ShowWeatherWeather MapsRadarWeather BlogWeather AcademyWeather RadioSevere Weather ResourcesClosingsLivestreamSportsLocal ScoresBeat the AceTeam of the WeekAaron's AcesCheerleader ChallengeCommunity CalendarContestsCOVID-19 MapGood Morning ArkLaMissGuest RecipesGuest Interview ... 2023 3 29 Musk, scientists call for halt to AI race sparked by ChatGPT Skip to contentTornado Disaster InfoWhat s Your Home ShowWeatherWeather MapsRadarWeather BlogWeather AcademyWeather RadioSevere Weather ScoresBeat the AceTeam of the WeekAaron s AcesCheerleader ChallengeCommunity MapGood Morning ArkLaMissGuest RecipesGuest Interview Request FormHealth ConnectionsPerfect HomeOur TownService SaluteSubmit Photos and VideosFeed Your SoulRecommend Your Favorite RestaurantMr. FoodTalking FoodTV ListingsS... musk scientists call halt ai race sparked chatgpt skip contenttornado disaster infowhat home showweatherweather mapsradarweather blogweather academyweather radiosevere weather scoresbeat aceteam weekaaron acescheerleader challengecommunity mapgood morning arklamissguest recipesguest interview request formhealth connectionsperfect homeour townservice salutesubmit photos videosfeed soulrecommend favorite restaurantmr foodtalking foodtv listingsstation jobscontact usmeet teamadvertise usjobsclo... musk scientist call halt ai race spark chatgpt skip contenttornado disaster infowhat home showweatherweather mapsradarweather blogweather academyweather radiosevere weather scoresbeat aceteam weekaaron acescheerleader challengecommunity mapgood morning arklamissguest recipesguest interview request formhealth connectionsperfect homeour townservice salutesubmit photo videosfeed soulrecommend favorite restaurantmr foodtalking foodtv listingsstation jobscontact usmeet teamadvertise usjobsclosed ... Musk, scientists call for halt to AI race sparked by ChatGPT musk scientists call halt ai race sparked chatgpt musk scientist call halt ai race spark chatgpt 8 533 negative -0.2247 [musk, scientist, call, halt, ai, race, spark, chatgpt, skip, contenttornado, disaster, infowhat, home, showweatherweather, mapsradarweather, blogweather, academyweather, radiosevere, weather, scoresbeat, aceteam, weekaaron, acescheerleader, challengecommunity, mapgood, morning, arklamissguest, recipesguest, interview, request, formhealth, connectionsperfect, homeour, townservice, salutesubmit, photo, videosfeed, soulrecommend, favorite, restaurantmr, foodtalking, foodtv, listingsstation, jo... 190 [(pause, 0.019994939651570582), (letter, 0.014984285895357536), (powerful, 0.013900773335435879), (outsmart, 0.011741687846638798), (musk, 0.011336369915633726), (race, 0.011166180664184238), (openai, 0.010810624582858494), (wozniak, 0.010301499583486142), (generator, 0.009691471170377056), (wrbl, 0.009635537384634872)]

Topic Visualization¶

In [33]:
fig = mod_BERT_neg.visualize_topics()
fig.write_html("bertopic_visualization.html")  # For saving as interactive HTML

fig.show()

Topic Frequency¶

In [64]:
fig = mod_BERT_neg.visualize_barchart()
fig.write_html("topic_frequency.html")

Topic Hierarchy¶

In [65]:
fig = mod_BERT_neg.visualize_hierarchy()
fig.write_html("topic_hierarchy.html")

Topic Similarity¶

In [66]:
fig = mod_BERT_neg.visualize_heatmap()
fig.write_html("topic_similarity.html")

Intertopic Distance Map¶

In [67]:
fig = mod_BERT_neg.visualize_topics()
fig.write_html("intertopic_distance_map.html")
In [25]:
print("Number of topics:", mod_BERT_neg.get_topic_freq().shape[0])
Number of topics: 248
In [35]:
news_ne.to_csv('gs://nlp-final/news_bert_ne.csv',index=False)

3. Negative Sentiment Analysis Overtime¶

3.1. Understanding the Main Topics¶

1. Topic Distribution¶

In [38]:
news_ne[['bert_topics','bert_topics_words']].sample(3, random_state = 42)
Out[38]:
bert_topics bert_topics_words
72547 75 [(jones, 0.043355561817188616), (florida, 0.04173536174273128), (warrant, 0.02963598659685979), (department, 0.025011008505508917), (tallahassee, 0.020878685891533483), (message, 0.018818827993956955), (nan, 0.01656412064049838), (computer, 0.015099882128683483), (scientist, 0.0141806098929733), (illegally, 0.013642911165518351)]
105006 213 [(tpr, 0.033550158894503856), (delmarva, 0.024728626694621437), (antonio, 0.016485931371684377), (texas, 0.012284491904531835), (flood, 0.01051443818294521), (infoaccurate, 0.010467426060000026), (fm, 0.010389263069446205), (wgmd, 0.01016508094318885), (talk, 0.0098345087259608), (accurate, 0.008830676110647888)]
107850 190 [(pause, 0.019994939651570582), (letter, 0.014984285895357536), (powerful, 0.013900773335435879), (outsmart, 0.011741687846638798), (musk, 0.011336369915633726), (race, 0.011166180664184238), (openai, 0.010810624582858494), (wozniak, 0.010301499583486142), (generator, 0.009691471170377056), (wrbl, 0.009635537384634872)]
In [45]:
news_ne['bert_topics'].value_counts(ascending = False).reset_index(name = 'count')
Out[45]:
bert_topics count
0 -1 2341
1 0 200
2 1 156
3 2 123
4 3 111
... ... ...
243 242 11
244 243 11
245 244 11
246 245 11
247 246 11

248 rows × 2 columns

In [47]:
news_ne['bert_topics'].value_counts(ascending = False, normalize = True).reset_index(name = 'portion')
Out[47]:
bert_topics portion
0 -1 0.235347
1 0 0.020107
2 1 0.015683
3 2 0.012366
4 3 0.011159
... ... ...
243 242 0.001106
244 243 0.001106
245 244 0.001106
246 245 0.001106
247 246 0.001106

248 rows × 2 columns

2. Topic related information: Interpretation¶

  • Topic: Each topic is typically assigned a unique identifier (an integer). Special attention should be paid to topic -1, as it often represents outliers or miscellaneous text.
  • Count: This indicates the number of documents associated with each topic. Topics with a high count are more prevalent in your dataset.
  • Name: Generated based on the most frequent and representative words of each topic. These names give a quick idea of what the topic is about.
  • Representation: Shows key words that are characteristic of the topic.
  • Representative_Docs: Provides documents (or parts of them) that are most representative of the topic. These can be used to understand the context in which the topic keywords appear.
In [60]:
negative_topic_df.head(10)
Out[60]:
Topic Count Name Representation Representative_Docs
0 -1 2341 -1_ai_news_use_say [ai, news, use, say, new, video, chatgpt, day, technology, make] [google ai system answer negative question vladimir putin ask russian make argument trump racist daily mail online home showbiz femail royal health science sport politics money video travel shop late headline nasa apple twitter game profile logout login privacy policy feedback thursday sep day forecast advertisement sophie turner sue lie ex joe jonas bid move kid back forever home uk accuse withholding passport refuse let travel bitter divorce oklahoma death row inmate execute lethal injecti...
1 0 200 0_ago_hour_newsfeed_story [ago, hour, newsfeed, story, weather, video, day, top, app, bestreviews] [bad deal good name ai rise sextortion skip content wilkes barre sign wilkes barre sponsor toggle menu open navigation close navigation search please enter search term primary menu news email newsletter signup local news crime court team submit team tip traffic information instapoll agnes fifty look back flood national news local election headquarters healthbeat veteran voice veteran view newsmakers eyewitness history week pennsylvania politics hill automotive news press release top story cr...
2 1 156 1_student_chatgpt_cheat_school [student, chatgpt, cheat, school, teacher, university, assignment, write, essay, professor] [seattle ban student use chatgpt doom come next skip main content turn refresh currently reading seattle ban student use chatgpt doom come next subscribe subscribe edition sign capital regionbest capital firsthudson valleysportshigh race empirepro sportsnew yorkstate calendarart exhibitsmovies tvmusic concertstheater dancewere see worldfood drinktable hoppinglife valuesspecial reportsreal estatefor salefor rentvirtual fairplace adclassifiedssearch classifiedsplace classify adlegal noticespla...
3 2 123 2_stereotype_porn_star_deepfake [stereotype, porn, star, deepfake, image, people, sex, less, may, robot] [deepfake porn could grow problem amid ai race fox skip content fox indianapolis live fmr pres trump speaks nra sign indianapolis live sponsor toggle menu open navigation close navigation search please enter search term primary menu news indiana news crimetracker video demand investigates crime mapping digital exclusive black history month ibj medium inside indiana business living healthy newsnation national world focus health bestreviews bestreviews daily deal local election headquarters po...
4 3 111 3_breast_cancer_mammogram_radiologist [breast, cancer, mammogram, radiologist, patient, screen, woman, doctor, detect, study] [could ai beat human spot breast tumour advertisement health breast cancer news medical lifestyle expert diet nutrition medical scheme beyond beauty home medical breast cancer news january could ai beat human spot breast tumour use ai spot breast tumour could next big thing replace human time soon machine train outperform human come catch breast tumour mammogram new study google several university work artificial intelligence ai model aim improve accuracy mammography screen january issue nat...
5 4 102 4_gebru_google_mitchell_fire [gebru, google, mitchell, fire, timnit, employee, departure, ethic, paper, company] [google ai team demand oust black researcher rehired promote kuaf skip main content site menu donate menu contact u internship newsletter staff station air web comm calendar daily weekly schedule podcasts r feed program way listen member volunteer become member issue challenge leave legacy membership faq sustain membership vehicle donation program volunteer search social medium facebook instagram linkedin twitter submit psa story underwriting sponsorship menu contact u internship newsletter ...
6 5 99 5_destefano_daughter_clone_voice [destefano, daughter, clone, voice, mom, phone, scam, mayo, kambhampati, go] [get daughter mom warns terrify ai voice clone scam fake kidnapping skip contentnewsfirst alert pickflint water vaultfirst alert day zone forecastradarday plannerscurrent conditionsweather camsclosings sportsfriday night lightssaginaw lifestyle showaging stylebest brightestbetter dealsexcellence educationjobspet vaultsubmit photo usnextgen tvmeet teamdownload news weather newscastsvuit scheduleinvestigate tvpower nationpress releasescircle country music lifestyle get daughter mom warns terri...
7 6 97 6_italy_italian_openai_chatgpt [italy, italian, openai, chatgpt, data, watchdog, protection, user, privacy, ban] [italy temporarily block chatgpt privacy concern today bc searchhome newsletter subscribe subscribe login logout support centre puzzle contest news covid politics national politics world news sport cannabis travel podcasts video opinion classified job business entertainment life weather obituary contact u contact u black press faq privacy policy term use news politics national cannabis travel obituary classified contact u subscribe login puzzle contest podcasts video life opinion trend black...
8 7 89 7_market_etf_stock_bank [market, etf, stock, bank, rate, number, canada, bloomberg, inflation, fed] [google fall behind ai arm race senior engineer warns bnn bloomberg market index currency energy metal number number data number number number number data number number market market market index currency energy metal number number data number number number number data number number currency stock formatprefix formatnetchange look stock try one result bnn look stock try one result live video show market call market invest personal finance real estate company news commodity economics politics...
9 8 82 8_cancer_tumor_patient_lung [cancer, tumor, patient, lung, cell, treatment, disease, diagnosis, study, brain] [new clue ai cancer prognosis main section home u upcoming event r donate contact u advertise category alternative therapy blood heart circulation bone muscle brain nerve cancer child health cosmetic surgery digestive system disorder condition drug approval trial ear nose throat environmental health eye vision female reproductive genetics birth defect geriatrics age health informatics hematology immune system infection kidney urinary system legal regulatory life style fitness lung breathing ...

3. Wordcloud for representation and representation_doc¶

In [71]:
# Flatten the list of words in each representation into a single string and then join all strings
all_representations = ' '.join([' '.join(repr_list) for repr_list in negative_topic_df['Representation']])

# Create a word cloud
wordcloud_rep = WordCloud(background_color='white').generate(all_representations)

# Plotting
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud_rep, interpolation='bilinear')
plt.axis('off')
plt.show()
No description has been provided for this image

Representative_Docs (1~11)¶

In [76]:
# Assuming 'Representative_Docs' contains lists of strings
for topic in range(1, 11):
    doc_list = negative_topic_df[negative_topic_df['Topic'] == topic]['Representative_Docs'].iloc[0]
    if isinstance(doc_list, list):
        doc_str = ' '.join(doc_list)  # Join list into a single string
    else:
        doc_str = doc_list  # If it's already a string

    # Generate word cloud
    wordcloud_doc = WordCloud(background_color='white').generate(doc_str)

    # Plotting
    plt.figure(figsize=(10, 5))
    plt.imshow(wordcloud_doc, interpolation='bilinear')
    plt.title(f"Word Cloud for Topic {topic}")
    plt.axis('off')
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

3.2. Negative sentiment and topic overtime¶

1. Yearly Analysis¶

1. Aggregate Topic Counts Over Time¶

In [81]:
# Count the frequency of each topic
topic_counts = news_ne['bert_topics'].value_counts()

# Remove topic -1 and get the top 10 topics
top_10_topics = topic_counts.drop(-1).nlargest(10).index
In [82]:
# Filter the dataset
filtered_news_ne = news_ne[news_ne['bert_topics'].isin(top_10_topics)]
In [83]:
# Group by year and topic, and count occurrences
topic_trends = filtered_news_ne.groupby(['year', 'bert_topics']).size().reset_index(name='counts')

2. Pivot the Data for Analysis¶

In [84]:
# Pivot the data
topic_trends_pivot = topic_trends.pivot(index='year', columns='bert_topics', values='counts').fillna(0)
In [85]:
topic_trends_pivot.head()
Out[85]:
bert_topics 0 1 2 3 4 5 6 7 8 9
year
2020 16.0 4.0 8.0 36.0 27.0 0.0 0.0 1.0 11.0 11.0
2021 9.0 3.0 18.0 17.0 60.0 0.0 0.0 1.0 12.0 26.0
2022 15.0 8.0 13.0 19.0 15.0 1.0 0.0 6.0 14.0 9.0
2023 160.0 141.0 84.0 39.0 0.0 98.0 97.0 81.0 45.0 34.0

3. Plot the Trends¶

In [86]:
# Plot
plt.figure(figsize=(12, 6))
for topic in topic_trends_pivot.columns:
    plt.plot(topic_trends_pivot.index, topic_trends_pivot[topic], label=f'Topic {topic}')

plt.xlabel('Year')
plt.ylabel('Topic Counts')
plt.title('Top 10 Topic Trends Over Time')
plt.legend()
plt.show()
No description has been provided for this image

4. Detailed Analysis¶

In [88]:
# Example: Print representations of the top N topics
top_topics = topic_trends_pivot.sum().sort_values(ascending=False).head(10).index
for topic in top_topics:
    print(f"Topic {topic}: {negative_topic_df.loc[negative_topic_df['Topic'] == topic, 'Representation'].iloc[0]}")
Topic 0: ['ago', 'hour', 'newsfeed', 'story', 'weather', 'video', 'day', 'top', 'app', 'bestreviews']
Topic 1: ['student', 'chatgpt', 'cheat', 'school', 'teacher', 'university', 'assignment', 'write', 'essay', 'professor']
Topic 2: ['stereotype', 'porn', 'star', 'deepfake', 'image', 'people', 'sex', 'less', 'may', 'robot']
Topic 3: ['breast', 'cancer', 'mammogram', 'radiologist', 'patient', 'screen', 'woman', 'doctor', 'detect', 'study']
Topic 4: ['gebru', 'google', 'mitchell', 'fire', 'timnit', 'employee', 'departure', 'ethic', 'paper', 'company']
Topic 5: ['destefano', 'daughter', 'clone', 'voice', 'mom', 'phone', 'scam', 'mayo', 'kambhampati', 'go']
Topic 6: ['italy', 'italian', 'openai', 'chatgpt', 'data', 'watchdog', 'protection', 'user', 'privacy', 'ban']
Topic 7: ['market', 'etf', 'stock', 'bank', 'rate', 'number', 'canada', 'bloomberg', 'inflation', 'fed']
Topic 8: ['cancer', 'tumor', 'patient', 'lung', 'cell', 'treatment', 'disease', 'diagnosis', 'study', 'brain']
Topic 9: ['china', 'chinese', 'beijing', 'taiwan', 'military', 'wray', 'superpower', 'kai', 'fu', 'epub']